Can you get rich by doing machine learning?¶
*——Let data help you become a talent in the field of machine learning*¶
1. Introduction¶
Machine learning is a typical multi-field interdisciplinary subject and is also a popular direction for our future employment. It is developing rapidly and has good development prospects, which makes it extremely attractive to us.
But I believe everyone has discovered a problem, that is, what is the relationship between the courses we learn every day and the work we will do in the future? We learn programming languages, databases, various model concepts, and even big data, trying to develop ourselves into all-round talents, but what kind of content can be useful in the end?
Becoming a talent cannot ignore practical issues. I believe everyone is also curious about how much money you can earn after becoming a talent in the field. So we decided to use this "insider" data set to dig into the questions that everyone is curious about.
2. Data loading and data information display¶
2-1 Importing third-party libraries¶
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pip -U plotly
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: pip in d:\anaconda\lib\site-packages (22.3.1) Requirement already satisfied: plotly in d:\anaconda\lib\site-packages (5.11.0) Requirement already satisfied: tenacity>=6.2.0 in d:\anaconda\lib\site-packages (from plotly) (8.1.0) Note: you may need to restart the kernel to use updated packages.
pip install missingno
Collecting missingno Downloading missingno-0.5.2-py3-none-any.whl.metadata (639 bytes) Requirement already satisfied: numpy in f:\anaconda\lib\site-packages (from missingno) (1.26.4) Requirement already satisfied: matplotlib in f:\anaconda\lib\site-packages (from missingno) (3.8.4) Requirement already satisfied: scipy in f:\anaconda\lib\site-packages (from missingno) (1.13.1) Requirement already satisfied: seaborn in f:\anaconda\lib\site-packages (from missingno) (0.13.2) Requirement already satisfied: contourpy>=1.0.1 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (1.2.0) Requirement already satisfied: cycler>=0.10 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (1.4.4) Requirement already satisfied: packaging>=20.0 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (23.2) Requirement already satisfied: pillow>=8 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (10.3.0) Requirement already satisfied: pyparsing>=2.3.1 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in f:\anaconda\lib\site-packages (from matplotlib->missingno) (2.9.0.post0) Requirement already satisfied: pandas>=1.2 in f:\anaconda\lib\site-packages (from seaborn->missingno) (2.2.2) Requirement already satisfied: pytz>=2020.1 in f:\anaconda\lib\site-packages (from pandas>=1.2->seaborn->missingno) (2024.1) Requirement already satisfied: tzdata>=2022.7 in f:\anaconda\lib\site-packages (from pandas>=1.2->seaborn->missingno) (2023.3) Requirement already satisfied: six>=1.5 in f:\anaconda\lib\site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0) Downloading missingno-0.5.2-py3-none-any.whl (8.7 kB) Installing collected packages: missingno Successfully installed missingno-0.5.2 Note: you may need to restart the kernel to use updated packages.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import LabelEncoder
import missingno as msno
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.tree import DecisionTreeRegressor , ExtraTreeRegressor
2-2 Reading Data¶
First, we need to have a general idea of the data set, see how many people in the industry filled out the questionnaire, and see what questions everyone was asked.
data = pd.read_csv('kaggle_survey_2022_responses.csv')
2-3 View basic data information¶
print("data shape =",data.shape)
data shape = (23998, 296)
print("Show the first five rows:")
data.head()
Show the first five rows:
| Duration (in seconds) | Q2 | Q3 | Q4 | Q5 | Q6_1 | Q6_2 | Q6_3 | Q6_4 | Q6_5 | ... | Q44_3 | Q44_4 | Q44_5 | Q44_6 | Q44_7 | Q44_8 | Q44_9 | Q44_10 | Q44_11 | Q44_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | Are you currently a student? (high school, uni... | On which platforms have you begun or completed... | On which platforms have you begun or completed... | On which platforms have you begun or completed... | On which platforms have you begun or completed... | On which platforms have you begun or completed... | ... | Who/what are your favorite media sources that ... | Who/what are your favorite media sources that ... | Who/what are your favorite media sources that ... | Who/what are your favorite media sources that ... | Who/what are your favorite media sources that ... | Who/what are your favorite media sources that ... | Who/what are your favorite media sources that ... | Who/what are your favorite media sources that ... | Who/what are your favorite media sources that ... | Who/what are your favorite media sources that ... |
| 1 | 121 | 30-34 | Man | India | No | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 462 | 30-34 | Man | Algeria | No | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 293 | 18-21 | Man | Egypt | Yes | Coursera | edX | NaN | DataCamp | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | Podcasts (Chai Time Data Science, O’Reilly Dat... | NaN | NaN | NaN | NaN | NaN |
| 4 | 851 | 55-59 | Man | France | No | Coursera | NaN | Kaggle Learn Courses | NaN | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | Course Forums (forums.fast.ai, Coursera forums... | NaN | NaN | Blogs (Towards Data Science, Analytics Vidhya,... | NaN | NaN | NaN | NaN |
5 rows × 296 columns
print("Last five rows:")
data.tail()
Last five rows:
| Duration (in seconds) | Q2 | Q3 | Q4 | Q5 | Q6_1 | Q6_2 | Q6_3 | Q6_4 | Q6_5 | ... | Q44_3 | Q44_4 | Q44_5 | Q44_6 | Q44_7 | Q44_8 | Q44_9 | Q44_10 | Q44_11 | Q44_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23993 | 331 | 22-24 | Man | United States of America | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | Podcasts (Chai Time Data Science, O’Reilly Dat... | NaN | Journal Publications (peer-reviewed journals, ... | NaN | NaN | NaN |
| 23994 | 330 | 60-69 | Man | United States of America | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | NaN | NaN | NaN | NaN | NaN | NaN |
| 23995 | 860 | 25-29 | Man | Turkey | No | NaN | NaN | NaN | DataCamp | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | NaN | NaN | NaN | NaN | NaN | NaN |
| 23996 | 597 | 35-39 | Woman | Israel | No | NaN | NaN | Kaggle Learn Courses | NaN | NaN | ... | NaN | NaN | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | NaN | NaN | NaN | NaN | NaN | NaN |
| 23997 | 303 | 18-21 | Man | India | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Other |
5 rows × 296 columns
data.dtypes
Duration (in seconds) object
Q2 object
Q3 object
Q4 object
Q5 object
...
Q44_8 object
Q44_9 object
Q44_10 object
Q44_11 object
Q44_12 object
Length: 296, dtype: object
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 23998 entries, 0 to 23997 Columns: 296 entries, Duration (in seconds) to Q44_12 dtypes: object(296) memory usage: 54.2+ MB
#Statistical information of the data set
#count-number of non-null values in each column unique-number of different values top-option with the most occurrences freq-frequency
data.describe()
| Duration (in seconds) | Q2 | Q3 | Q4 | Q5 | Q6_1 | Q6_2 | Q6_3 | Q6_4 | Q6_5 | ... | Q44_3 | Q44_4 | Q44_5 | Q44_6 | Q44_7 | Q44_8 | Q44_9 | Q44_10 | Q44_11 | Q44_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 23998 | 23998 | 23998 | 23998 | 23998 | 9700 | 2475 | 6629 | 3719 | 945 | ... | 2679 | 11182 | 4007 | 11958 | 2121 | 7767 | 3805 | 1727 | 1 | 836 |
| unique | 4329 | 12 | 6 | 59 | 3 | 2 | 2 | 2 | 2 | 2 | ... | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 2 |
| top | 230 | 18-21 | Man | India | No | Coursera | edX | Kaggle Learn Courses | DataCamp | Fast.ai | ... | Reddit (r/machinelearning, etc) | Kaggle (notebooks, forums, etc) | Course Forums (forums.fast.ai, Coursera forums... | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | Podcasts (Chai Time Data Science, O’Reilly Dat... | Blogs (Towards Data Science, Analytics Vidhya,... | Journal Publications (peer-reviewed journals, ... | Slack Communities (ods.ai, kagglenoobs, etc) | Who/what are your favorite media sources that ... | Other |
| freq | 59 | 4559 | 18266 | 8792 | 12036 | 9699 | 2474 | 6628 | 3718 | 944 | ... | 2678 | 11181 | 4006 | 11957 | 2120 | 7766 | 3804 | 1726 | 1 | 835 |
4 rows × 296 columns
3. Data analysis¶
Then we came to the data analysis module. The redundant data made people dizzy. If we want to find the content we need, we need to visualize the data to see the information more intuitively.
data_1 = data.drop([0])
data_1
| Duration (in seconds) | Q2 | Q3 | Q4 | Q5 | Q6_1 | Q6_2 | Q6_3 | Q6_4 | Q6_5 | ... | Q44_3 | Q44_4 | Q44_5 | Q44_6 | Q44_7 | Q44_8 | Q44_9 | Q44_10 | Q44_11 | Q44_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 121 | 30-34 | Man | India | No | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 462 | 30-34 | Man | Algeria | No | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 293 | 18-21 | Man | Egypt | Yes | Coursera | edX | NaN | DataCamp | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | Podcasts (Chai Time Data Science, O’Reilly Dat... | NaN | NaN | NaN | NaN | NaN |
| 4 | 851 | 55-59 | Man | France | No | Coursera | NaN | Kaggle Learn Courses | NaN | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | Course Forums (forums.fast.ai, Coursera forums... | NaN | NaN | Blogs (Towards Data Science, Analytics Vidhya,... | NaN | NaN | NaN | NaN |
| 5 | 232 | 45-49 | Man | India | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | Blogs (Towards Data Science, Analytics Vidhya,... | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23993 | 331 | 22-24 | Man | United States of America | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | Podcasts (Chai Time Data Science, O’Reilly Dat... | NaN | Journal Publications (peer-reviewed journals, ... | NaN | NaN | NaN |
| 23994 | 330 | 60-69 | Man | United States of America | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | NaN | NaN | NaN | NaN | NaN | NaN |
| 23995 | 860 | 25-29 | Man | Turkey | No | NaN | NaN | NaN | DataCamp | NaN | ... | NaN | Kaggle (notebooks, forums, etc) | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | NaN | NaN | NaN | NaN | NaN | NaN |
| 23996 | 597 | 35-39 | Woman | Israel | No | NaN | NaN | Kaggle Learn Courses | NaN | NaN | ... | NaN | NaN | NaN | YouTube (Kaggle YouTube, Cloud AI Adventures, ... | NaN | NaN | NaN | NaN | NaN | NaN |
| 23997 | 303 | 18-21 | Man | India | Yes | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Other |
23997 rows × 296 columns
3-1 Distribution map of personal information¶
3-1-1 Age (Q2) distribution¶
In the computer field, age is capital; the younger you are, the more miracles you can create.
Num_Q2 = data_1["Q2"].value_counts()
print(Num_Q2)
#Presented in pie chart form
#hole: Set the hollow radius ratio [0,1]
#template: Canvas style, there are several options: ggplot2, seaborn, simple_white, plotly, plotly_white
fig=px.pie(values=Num_Q2.values, names=Num_Q2.index, hole=0.7, template='plotly_white')
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.add_annotation(dict(x=0.5, y=0.5, align='center',
xref = "paper", yref = "paper",
showarrow = False, font_size=22,
text="<b>Age</b>"))
fig.show()
Q2 18-21 4559 25-29 4472 22-24 4283 30-34 2972 35-39 2353 40-44 1927 45-49 1253 50-54 914 55-59 611 60-69 526 70+ 127 Name: count, dtype: int64
3-1-2 Gender (Q3) distribution¶
Gender: In the computer field, men do make up the majority, but women can create immeasurable value just like men.
Num_Q3 = data_1["Q3"].value_counts()
print(Num_Q3)
fig=px.pie(values=Num_Q3.values, names=Num_Q3.index, hole=0.7, template='ggplot2')
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.add_annotation(dict(x=0.5, y=0.5, align='center',
xref = "paper", yref = "paper",
showarrow = False, font_size=22,
text="<b>Gender</b>"))
fig.show()
Q3 Man 18266 Woman 5286 Prefer not to say 334 Nonbinary 78 Prefer to self-describe 33 Name: count, dtype: int64
3-1-3 Regional (Q4) distribution¶
Kaggle is a very successful platform. People all over the world use Kaggle for learning and communication. Knowledge can only spark greater sparks through collision.
Num_Q4 = data_1["Q4"].value_counts()
print(Num_Q4)#Series
print("---------------------------------------------------------------------")
Num_Q42=pd.DataFrame({'Country':Num_Q4.index,'Count':Num_Q4.values})
print(Num_Q42)
#There are too many countries, so they are presented in the form of a histogram
#x, y: horizontal and vertical coordinates color: used to add legends template: canvas style text: display numbers title: title
fig=px.bar(Num_Q42, x='Country', y='Count', color='Country', template='seaborn', text='Count', title='<b>Country')
fig.show()
Q4
India 8792
United States of America 2920
Other 1430
Brazil 833
Nigeria 731
Pakistan 620
Japan 556
China 453
Egypt 383
Mexico 380
Indonesia 376
Turkey 345
Russia 324
South Korea 317
France 262
United Kingdom of Great Britain and Northern Ireland 258
Spain 257
Canada 257
Colombia 256
Bangladesh 251
Taiwan 242
Viet Nam 212
Argentina 204
Kenya 201
Italy 182
Morocco 177
Australia 142
Thailand 132
Tunisia 125
Peru 121
Iran, Islamic Republic of... 120
Chile 115
Poland 113
South Africa 109
Philippines 108
Netherlands 108
Ghana 107
Israel 102
Germany 99
Ethiopia 98
United Arab Emirates 94
Portugal 87
Saudi Arabia 84
Ukraine 79
Sri Lanka 77
Nepal 75
Malaysia 74
Singapore 68
Cameroon 68
Algeria 62
Hong Kong (S.A.R.) 58
Zimbabwe 54
Ecuador 54
Ireland 53
Belgium 51
Romania 50
Czech Republic 49
I do not wish to disclose my location 42
Name: count, dtype: int64
---------------------------------------------------------------------
Country Count
0 India 8792
1 United States of America 2920
2 Other 1430
3 Brazil 833
4 Nigeria 731
5 Pakistan 620
6 Japan 556
7 China 453
8 Egypt 383
9 Mexico 380
10 Indonesia 376
11 Turkey 345
12 Russia 324
13 South Korea 317
14 France 262
15 United Kingdom of Great Britain and Northern I... 258
16 Spain 257
17 Canada 257
18 Colombia 256
19 Bangladesh 251
20 Taiwan 242
21 Viet Nam 212
22 Argentina 204
23 Kenya 201
24 Italy 182
25 Morocco 177
26 Australia 142
27 Thailand 132
28 Tunisia 125
29 Peru 121
30 Iran, Islamic Republic of... 120
31 Chile 115
32 Poland 113
33 South Africa 109
34 Philippines 108
35 Netherlands 108
36 Ghana 107
37 Israel 102
38 Germany 99
39 Ethiopia 98
40 United Arab Emirates 94
41 Portugal 87
42 Saudi Arabia 84
43 Ukraine 79
44 Sri Lanka 77
45 Nepal 75
46 Malaysia 74
47 Singapore 68
48 Cameroon 68
49 Algeria 62
50 Hong Kong (S.A.R.) 58
51 Zimbabwe 54
52 Ecuador 54
53 Ireland 53
54 Belgium 51
55 Romania 50
56 Czech Republic 49
57 I do not wish to disclose my location 42
3-1-4 Job Positions (Q23) Distribution¶
I believe everyone is very interested in this part. That is, as we are in the field of big data, what are the future career options?
Num_Q23 = data_1["Q23"].value_counts()
print(Num_Q23)
Num_Q23_2=pd.DataFrame({'Job':Num_Q23.index,'Count':Num_Q23.values})
# print(Num_Q23_2)
fig=px.bar(Num_Q23_2, x='Job', y='Count', color='Job', template='seaborn', text='Count', title='<b>Job')
fig.show()
Q23 Data Scientist 1929 Data Analyst (Business, Marketing, Financial, Quantitative, etc) 1538 Currently not employed 1432 Software Engineer 980 Teacher / professor 833 Manager (Program, Project, Operations, Executive-level, etc) 832 Other 754 Research Scientist 593 Machine Learning/ MLops Engineer 571 Engineer (non-software) 465 Data Engineer 352 Statistician 125 Data Architect 95 Data Administrator 70 Developer Advocate 61 Name: count, dtype: int64
3-1-5 Distribution of work areas (Q24)¶
As mentioned in the introduction, machine learning is a typical multi-disciplinary subject. So what specific fields does it involve?
Num_Q24 = data_1["Q24"].value_counts()
print(Num_Q24)
fig=px.pie(values=Num_Q24.values, names=Num_Q24.index, hole=0.4, template='plotly')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.add_annotation(dict(x=0.5, y=0.5, align='center',
xref = "paper", yref = "paper",
showarrow = False, font_size=22,
text="<b>Job Area</b>"))
fig.show()
Q24 Computers/Technology 2321 Academics/Education 1447 Accounting/Finance 802 Other 750 Manufacturing/Fabrication 561 Medical/Pharmaceutical 509 Government/Public Service 500 Online Service/Internet-based Services 461 Retail/Sales 398 Energy/Mining 320 Insurance/Risk Assessment 256 Marketing/CRM 246 Non-profit/Service 194 Broadcasting/Communications 179 Shipping/Transportation 150 Name: count, dtype: int64
3-1-6 Salary (Q29) distribution¶
This brings us to the point that we are most concerned about. How much wealth can we create for ourselves? After looking at this distribution, we can't help but want to find a position for ourselves. What factors do we need to make us high-paid talents? This will be the question we will discuss later.
Num_Q29 = data_1["Q29"].value_counts()
print(Num_Q29)
# print("---------------------------------------------------------------------")
Num_Q29_2=pd.DataFrame({'Salary':Num_Q29.index,'Count':Num_Q29.values})
fig=px.bar(Num_Q29_2, x='Salary', y='Count', color='Salary', template='seaborn', text='Count', title='<b>Salary')
fig.show()
Q29 $0-999 1112 10,000-14,999 493 30,000-39,999 464 1,000-1,999 444 40,000-49,999 421 100,000-124,999 404 5,000-7,499 391 50,000-59,999 366 7,500-9,999 362 150,000-199,999 342 20,000-24,999 337 60,000-69,999 318 15,000-19,999 299 70,000-79,999 289 25,000-29,999 277 2,000-2,999 271 125,000-149,999 269 3,000-3,999 244 4,000-4,999 234 80,000-89,999 222 90,000-99,999 197 200,000-249,999 155 250,000-299,999 78 300,000-499,999 76 $500,000-999,999 48 >$1,000,000 23 Name: count, dtype: int64
3-1-7 Educational background (Q8) distribution¶
As the computer field is a field that produces many high-end talents, what is the educational level of everyone?
Num_Q8 = data_1["Q8"].value_counts()
print(Num_Q8)
fig=px.pie(values=Num_Q8.values, names=Num_Q8.index, hole=0.4, template='plotly_white')
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.update_layout(
title={
"text":"<b>Academic Qualification<b>",
"y":0.96, # y
"x":0.4, # x
}
)
fig.show()
Q8 Master’s degree 9142 Bachelor’s degree 7625 Doctoral degree 2657 Some college/university study without earning a bachelor’s degree 1431 I prefer not to answer 1394 Professional doctorate 585 No formal education past high school 564 Name: count, dtype: int64
3-2 ML&DS related data distribution diagram¶
We have just analyzed the basic situation of industry insiders, and now we come to a more detailed analysis of data issues that are highly relevant to the profession.
When programming every day, you may wonder, will I use the software I use now, the languages I write, the platforms I run, the libraries I call, the algorithms I use, etc. in the future?
To put it more simply, when we first entered school, we were all curious about why we learned advanced mathematics. We don’t need advanced mathematics to do calculations when we go out to buy vegetables. Do we still need to use calculus to calculate bean sprouts?
In order to solve such doubts, let’s take a look at the answers of the big guys who have been programming for different periods of time to the above questions.
3-2-1 Distribution of time spent using machine learning methods (Q16)¶
Num_Q16 = data_1["Q16"].value_counts()
print(Num_Q16)
fig=px.pie(values=Num_Q16.values, names=Num_Q16.index, hole=0.4, template='ggplot2')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(
title={
"text":"Time distribution using machine learning methods",
"y":0.96, # y
"x":0.4, # x
}
)
fig.show()
Q16 Under 1 year 7221 1-2 years 3720 I do not use machine learning methods 3419 2-3 years 1947 5-10 years 1090 3-4 years 1053 4-5 years 950 10-20 years 483 20 or more years 3 Name: count, dtype: int64
3-2-2 Programming time (Q11) distribution¶
Num_Q11 = data_1["Q11"].value_counts()
print(Num_Q11)
fig=px.pie(values=Num_Q11.values, names=Num_Q11.index, hole=0.4, template='ggplot2')
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.update_layout(
title={
"text":"Programming time distribution",
"y":0.96, # y
"x":0.45, # x
}
)
fig.show()
Q11 1-3 years 6459 < 1 years 5454 3-5 years 3399 5-10 years 2556 I have never written code 2037 10-20 years 1801 20+ years 1537 Name: count, dtype: int64
3-2-3 Usage of learning platform (Q6)¶
Let’s first take a look at which of these learning platforms are the most popular, so that we can develop a new learning platform based on CSDN in the future and learn more applicable knowledge.
data_1["Q6_1"].value_counts() #9699
Q6_1 Coursera 9699 Name: count, dtype: int64
platform=['Coursera','edX','Kaggle Learn Courses','DataCamp','Fast.ai','Udacity','Udemy','LinkedIn Learning','Cloud-certification programs','University Courses','None','Other']
count=[9699,2474,6628,3718,944,2199,6116,2766,1821,6780,2643,5669]
df2=pd.DataFrame({'Platform':platform,'Count':count})
df2=df2.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(df2,x='Platform',y='Count',color='Platform',text='Count',template='simple_white',title='<b>Platforms used by Kagglers for completing Data Science Courses')
fig.update_layout(title_x=0.5)
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()
We found that the top four most popular learning platforms are shown in the figure. Let’s take a look at the distribution of these most popular platforms among “newbies” and “veterans”.
a1=data_1.groupby(['Q6_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['Platform']=a1['Q6_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q6_10','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['Platform']=a2['Q6_10']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q6_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['Platform']=a3['Q6_3']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q6_7','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['Platform']=a4['Q6_7']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']
fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Coursera<em>','<em>University Courses','<em>Kaggle Learn Courses', '<em>Udemy'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['Platform']=='Coursera']['Count'],name='Coursera',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['Platform']=='University Courses (resulting in a university degree)']['Count'],name='University Courses',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['Platform']=='Kaggle Learn Courses']['Count'],name='Kaggle Learn Courses',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['Platform']=='Udemy']['Count'],name='Udemy',text=a4['Count']),row=2,col=2)
3-2-4 Programming language (Q12) usage¶
The major programming languages run almost throughout our first two years of study. Perhaps you have also wondered, are these languages very commonly used languages ?
pl=['Python','R','SQL','C','C#','C++','Java','Javascript','Bash','PHP','MATLAB','Julia','Go','None','Other']
count=[18653,4571,9620,3801,1473,4549,3862,3489,1674,1443,2441,296,322,256,1342]
df7=pd.DataFrame({'Programming Language':pl,'Count':count})
df7.sort_values(by='Count',ascending=False,inplace=True)
df7.reset_index(drop=True)
fig=px.bar(df7,x='Programming Language',y='Count',color='Programming Language',template='simple_white',text='Count',title='<b>What programming languages do Kagglers use on a regular basis?</b>')
fig.update_layout(title_x=0.5)
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()
Through the visualization of question Q12, we found that we are suffering when learning various languages, but the most used language at present is Python. If you want to become a talent, you still have to write Python.
a1=data_1.groupby(['Q12_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['Language']=a1['Q12_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q12_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['Language']=a2['Q12_3']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q12_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['Language']=a3['Q12_2']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q12_6','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['Language']=a4['Q12_6']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']
fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Python<em>','<em>SQL ','<em>R', '<em>C++'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['Language']=='Python']['Count'],name='Python',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['Language']=='SQL']['Count'],name='SQL',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['Language']=='R']['Count'],name='R',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['Language']=='C++']['Count'],name='C++',text=a4['Count']),row=2,col=2)
Let's take a look at how well programmers with different programming time master these codes. We find that Python is indispensable among programmers with a short programming time.
3-2-5 Integrated development environment (Q13) usage¶
I believe that everyone has downloaded a lot of software on their computers, and has configured one environment after another. Let's take a look at which integrated development environment is the most popular.
tool=['JupyterLab','RStudio','Visual Studio','Visual Studio Code (VSCode)','PyCharm','Spyder','Notepad++','Sublime Text','Vim / Emacs','MATLAB','Jupyter Notebook','IntelliJ','None','Other']
count=[4887,3824,4416,9976,6099,2880,3891,2218,1448,2302,13684,1612,409,1474]
tools=pd.DataFrame({'Tool':tool,'Count':count})
tools=tools.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(tools,x='Tool',y='Count',color='Tool',text='Count',template='simple_white',title='<b>IDE\'s Used by Kagglers on regular a basis</b>')
fig.update_layout(title_x=0.5)
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()
After the visualization of question Q13, we found that jupyter notebook won the top spot without any suspense. When we use jupyter, it provides a very convenient one-code-block-one-run function, which saves the trouble of calling many different files. The visualization results can also be displayed directly. It is powerful and well-deserved.
a1=data_1.groupby(['Q13_11','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['Ide']=a1['Q13_11']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q13_4','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['ide']=a2['Q13_4']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q13_5','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['ide']=a3['Q13_5']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q13_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['ide']=a4['Q13_1']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']
fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Jupyter Notebook<em>','<em>Visual Studio Code (VSCode)','<em>PyCharm', '<em>JupyterLab'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['Ide']==' Jupyter Notebook']['Count'],name='Jupyter Notebook',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['ide']==' Visual Studio Code (VSCode) ']['Count'],name='Visual Studio Code (VSCode)',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['ide']==' PyCharm ']['Count'],name='PyCharm',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['ide']=='JupyterLab ']['Count'],name='JupyterLab',text=a4['Count']),row=2,col=2)
We have observed that in the eyes of programmers of different qualifications, Jupyter Notebook is still the top choice. It is friendly to novices and can also meet the needs of programming experts.
3-2-6 Usage of data visualization libraries (Q15)¶
Data visualization is an indispensable part of data analysis. Irregular and cold data can be understood by everyone through visualization, regardless of whether they have basic programming skills or not. This is essential for making reports and discussing projects with bosses clearly.
library=['Matplotlib','Seaborn','Plotly / Plotly Express','Ggplot / ggplot2','Shiny','D3 js','Altair','Bokeh','Geoplotlib','Leaflet / Folium','Pygal','Dygraphs','Highcharter','None','Other']
count=[14010,10512,5078,4145,1043,734,300,771,1167,554,318,225,198,3439,691]
data=pd.DataFrame({'Library':library,'Count':count})
data=data.sort_values(by='Count',ascending=False).reset_index (drop=True)
fig=px.bar(data,x='Library',y='Count',template='simple_white',color='Library',text='Count',title='<b>Data visualization libraries used by kagglers on a regular basis</b>')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()
Through the visualization of question Q15, we found that matplotlib is the most commonly used data visualization library, which includes various common image drawing functions.
a1=data_1.groupby(['Q15_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['dv']=a1['Q15_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q15_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['dv']=a2['Q15_2']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q15_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['dv']=a3['Q15_3']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q15_4','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['dv']=a4['Q15_4']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']
fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Matplotlib<em>','<em>Seaborn','<em>Plotly / Plotly Express', '<em>Ggplot / ggplot2'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['dv']==' Matplotlib ']['Count'],name='Matplotlib',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['dv']==' Seaborn ']['Count'],name='Seaborn',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['dv']==' Plotly / Plotly Express ']['Count'],name='Plotly / Plotly Express',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['dv']==' Ggplot / ggplot2 ']['Count'],name='Ggplot / ggplot2',text=a4['Count']),row=2,col=2)
Similar to the integrated environment, matplotlib is also widely popular among novices and veterans in data visualization, followed by seaborn
3-2-7 Machine Learning Framework (Q17) Usage¶
The machine learning framework guides us to learn about this subject. In the hands of practitioners, it is an indispensable helper. Through training and learning from various existing models, various results can be obtained for different problems. The machine learning framework has an immeasurable impact on students and practitioners in terms of depth and breadth.
work=['Scikit-learn','TensorFlow','Keras','PyTorch','Fast.ai','Xgboost','LightGBM','CatBoost','Caret','Tidymodels','JAX','PyTorch Lightning','Huggingface','None','Other']
count=[11403,7953,6575,5191,648,4477,1940,1165,821,547,252,1013,1332,1709,620]
d1=pd.DataFrame({'Frameworks':work,'Count':count})
d1=d1.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(d1,x='Frameworks',y='Count',color='Frameworks',text='Count',template='simple_white',title='<b>Machine learning frameworks used by kagglers on a regular basis')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()
By analyzing the visualization of question Q17, we can find that sklearn is the most commonly used machine learning framework, which contains a variety of models. In this semester's statistical learning theory course, we also learned many models in this library.
a1=data_1.groupby(['Q17_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['ml']=a1['Q17_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q17_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['ml']=a2['Q17_2']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q17_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['ml']=a3['Q17_3']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q17_4','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['ml']=a4['Q17_4']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']
fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Scikit-learn<em>','<em>TensorFlow','<em>Keras', '<em>PyTorch'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['ml']==' Scikit-learn ']['Count'],name='Scikit-learn',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['ml']==' TensorFlow ']['Count'],name='TensorFlow',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['ml']==' Keras ']['Count'],name='Keras',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['ml']==' PyTorch ']['Count'],name='PyTorch',text=a4['Count']),row=2,col=2)
sklearn contains a wide range of content and is suitable for all kinds of people. Whether it is the first time or a proficient user, the sklearn library can meet all kinds of needs, and it can be applied in many fields and types.
3-2-8 Use of machine learning algorithms (Q18)¶
Machine learning algorithms are closely connected with mathematical derivation. In class, we explained the derivation process of models such as logistic regression, decision tree, Bayesian classifier, etc. Only after understanding them from a mathematical level can we select the appropriate model for training when applying them. This is very important.
work=['Linear or Logistic Regression','Decision Trees or Random Forests','Gradient Boosting Machines (xgboost, lightgbm, etc)','Bayesian Approaches','Evolutionary Approaches','Dense Neural Networks (MLPs, etc)','Convolutional Neural Networks','Generative Adversarial Networks','Recurrent Neural Networks','Transformer Networks (BERT, gpt-3, etc)','Autoencoder Networks (DAE, VAE, etc)','Graph Neural Networks','None','Other']
count=[11338,9373,5506,3661,823,3476,6006,1166,3451,2196,1234,1422,1326,538]
d1=pd.DataFrame({'Algorithms':work,'Count':count})
d1=d1.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(d1,x='Algorithms',y='Count',color='Algorithms',text='Count',template='simple_white',title='<b>ML algorithms used by kagglers on a regular basis')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()
I found that linear or logistic regression is the most commonly used machine learning algorithm. Many real-world problems are linear problems. Some multi-classification problems can also be transformed into the accumulation of multiple logistic regressions. As taught in class, the sum of multiple binary classification problems is a multi-classification problem.
a1=data_1.groupby(['Q18_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['ml']=a1['Q18_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q18_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['ml']=a2['Q18_2']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q18_7','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['ml']=a3['Q18_7']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q18_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['ml']=a4['Q18_3']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']
fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Linear or Logistic Regression<em>','<em>Decision Trees or Random Forests','<em>Convolutional Neural Networks', '<em>Gradient Boosting Machines'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['ml']=='Linear or Logistic Regression']['Count'],name='Linear or Logistic Regression',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['ml']=='Decision Trees or Random Forests']['Count'],name='Decision Trees or Random Forests',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['ml']=='Convolutional Neural Networks']['Count'],name='Convolutional Neural Networks',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['ml']=='Gradient Boosting Machines (xgboost, lightgbm, etc)']['Count'],name='Gradient Boosting Machines',text=a4['Count']),row=2,col=2)
I found that linear and logistic regression problems are the most commonly used by beginners. I would guess that this is because linear problems are simpler than nonlinear ones. For programming veterans, I think it is because many problems can be solved using linear logistic regression, or dividing difficult problems makes them simpler and easier to handle.
3-2-9 Use of computer vision methods (Q19)¶
The application of computer vision is very extensive. It has many applications in medicine, agriculture, driving and other fields. Among the computer vision methods, let's take a look at the most commonly used methods.
work=['General purpose image/video tools','Image segmentation methods','Object detection methods','Image classification and other general purpose networks','Generative Networks','Vision transformer networks','None','Other']
count=[2293,2495,2525,3664,1343,782,1455,146]
d1=pd.DataFrame({'Vision Methods':work,'Count':count})
d1=d1.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(d1,x='Vision Methods',y='Count',color='Vision Methods',text='Count',template='simple_white',title='<b>Computer vision methods used on a regular basis')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()
Image classification and other general purpose networks are the most used computer vision methods. Image classification has many practical applications in our other artificial intelligence course. After understanding the relationship between artificial intelligence, machine learning, and deep learning, the learning process will be very clear.
a1=data_1.groupby(['Q19_4','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['ml']=a1['Q19_4']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q19_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['ml']=a2['Q19_3']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q19_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['ml']=a3['Q19_2']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q19_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['ml']=a4['Q19_1']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']
fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>Image classification<em>','<em>Object detection methods','<em>Image segmentation methods', '<em>General purpose image/video tools'))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['ml']=='Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)']['Count'],name='Image classification(VGG, Inception, ResNet)',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['ml']=='Object detection methods (YOLOv6, RetinaNet, etc)']['Count'],name='Object detection methods (YOLOv6, RetinaNet, etc)',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['ml']=='Image segmentation methods (U-Net, Mask R-CNN, etc)']['Count'],name='Image segmentation methods(U-Net, Mask R-CNN)',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['ml']=='General purpose image/video tools (PIL, cv2, skimage, etc)']['Count'],name='General purpose image/video tools(PIL, cv2)',text=a4['Count']),row=2,col=2)
3-2-10 Cloud platform (Q32) usage¶
The amount of data is always a problem. Using products from major cloud platforms facilitates data storage and retrieval.
Num_Q32 = data_1["Q32"].value_counts()
print(Num_Q32)
fig=px.pie(values=Num_Q11.values, names=Num_Q11.index, hole=0.4, template='ggplot2')
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.update_layout(
title={
"text":"Cloud platform usage distribution",
"y":0.96, # y轴数值
"x":0.45, # x轴数值
}
)
fig.show()
Q32 Amazon Web Services (AWS) 555 Google Cloud Platform (GCP) 501 They all had a similarly enjoyable developer experience 443 Microsoft Azure 256 None were satisfactory 72 IBM Cloud / Red Hat 34 Oracle Cloud 20 Other 20 VMware Cloud 12 SAP Cloud 7 Alibaba Cloud 5 Tencent Cloud 3 Huawei Cloud 1 Name: count, dtype: int64
3-2-11 Database (Q35) usage¶
The use of databases has a great relationship with our machine learning. During the use of databases, we can observe the characteristics of data and perform operations such as screening and selecting data features.
work=['MySQL','PostgreSQL ','SQLite ','Oracle Database ','MongoDB ','Snowflake ','IBM Db2 ','Microsoft SQL Server ','Microsoft Azure SQL Database ','Amazon Redshift ','Amazon RDS ','Amazon DynamoDB ','Google Cloud BigQuery ','Google Cloud SQL ','None','Other']
count=[2233,1516,1159,688,1031,399,192,1203,520,380,505,356,690,439,955,217]
d1=pd.DataFrame({'Database':work,'Count':count})
d1=d1.sort_values(by='Count',ascending=False).reset_index(drop=True)
fig=px.bar(d1,x='Database',y='Count',color='Database',text='Count',template='simple_white',title='<b>Data product used by kagglers on a regular basis')
fig.update_traces(marker=dict(line=dict(color='#000000', width=1.6)))
fig.show()
a1=data_1.groupby(['Q35_1','Q11'],as_index=False)['Duration (in seconds)'].count()
a1['ml']=a1['Q35_1']
a1['Year']=a1['Q11']
a1['Count']=a1['Duration (in seconds)']
# a.drop(['Q2','Q23','Duration (in seconds)'],axis=1,inplace=True)
a2=data_1.groupby(['Q35_2','Q11'],as_index=False)['Duration (in seconds)'].count()
a2['ml']=a2['Q35_2']
a2['Year']=a2['Q11']
a2['Count']=a2['Duration (in seconds)']
a3=data_1.groupby(['Q35_8','Q11'],as_index=False)['Duration (in seconds)'].count()
a3['ml']=a3['Q35_8']
a3['Year']=a3['Q11']
a3['Count']=a3['Duration (in seconds)']
a4=data_1.groupby(['Q35_3','Q11'],as_index=False)['Duration (in seconds)'].count()
a4['ml']=a4['Q35_3']
a4['Year']=a4['Q11']
a4['Count']=a4['Duration (in seconds)']
fig=make_subplots(rows=2,cols=2,subplot_titles=('<em>MySQL<em>','<em>PostgreSQL ','<em>Microsoft SQL Server ', '<em>SQLite '))
fig.add_trace(go.Bar(x=a1['Year'],y=a1[a1['ml']=='MySQL ']['Count'],name='MySQL',text=a1['Count']),row=1,col=1)
fig.add_trace(go.Bar(x=a2['Year'],y=a2[a2['ml']=='PostgreSQL ']['Count'],name='PostgreSQL ',text=a2['Count']),row=1,col=2)
fig.add_trace(go.Bar(x=a3['Year'],y=a3[a3['ml']=='Microsoft SQL Server ']['Count'],name='Microsoft SQL Server ',text=a3['Count']),row=2,col=1)
fig.add_trace(go.Bar(x=a4['Year'],y=a4[a4['ml']=='SQLite ']['Count'],name='SQLite ',text=a4['Count']),row=2,col=2)
4. Data prediction¶
After reading various data visualization analyses, we have a better understanding of the basic situation of practitioners. So don't forget that the original intention of our analysis this time is actually a more practical issue.
No matter what, supporting a family is the top priority. Everyone says that working in the computer industry is easy to make money, so let's follow the data to dig deep into the industry situation. After all, this is the last step in our talent creation.
Knowledge is important, but knowledge that can create gold is more important. So what kind of knowledge can be related to our creation of gold?
What characteristics or elements do we need to have?
Let's take these questions to find out!
4-1 Selection of independent and dependent variables¶
4-1-1 Dependent variable - transformation of salary field¶
Salary is used as the dependent variable, but we have observed a problem. This salary is a series of numerical segments, and the intervals between each numerical segment are not equally spaced. If it is made into a classification problem, the data segment has three main problems: large interval differences, it cannot contain all salaries, and people do not have an intuitive concept of their own salary.
When we consider the problem as a regression problem, the above problems can be effectively alleviated.
#Replace the original salary range with the mean$0-999, 200,000-249,999, >$1,000,000
data_1['Q29'] = data_1['Q29'].str.replace('$','').str.replace(',','').str.replace('>','1000000-')
data_1[['Sal_l', 'Sal_h']]=data_1['Q29'].str.split('-', n=1, expand=True)#n:分割次数 expand:扩展为Dataframe
# data_1.dtypes
data_1['Sal_l'] = pd.to_numeric(data_1['Sal_l'])
data_1['Sal_h'] = pd.to_numeric(data_1['Sal_h'])
data_1['Salary'] = round((data_1['Sal_l'] + data_1['Sal_h']) / 2)
# print(data_1['Salary'])
data_1.drop(["Sal_l","Sal_h","Q29"], axis=1, inplace=True)
Remove outliers
At the beginning of the training process, we did not consider the issue of removing outliers, which led to poor training results. After consulting relevant materials, we found that outliers would affect the training process of the model, and "bad apples" should be removed to prevent outliers from affecting the training results in the subsequent process.
fig = px.box(
data_1,
y="Salary" ,
points="all"
)
fig.show()
def outlier(data):
df = data
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3-Q1
Obs_min = Q1-1.5*IQR
Obs_max = Q3+1.5*IQR
result = (df < Obs_min) | (df > Obs_max)#bool
index = df[result].index #(Int64Index)
# print(index)
df = pd.DataFrame(df)
df[~((df < (Obs_min)) | (df > (Obs_max)))]
return index
# outlier(data_1['Salary'])
data_1.drop(list(outlier(data_1.Salary)), axis=0, inplace=True)#23617
# data_1
data_1 = data_1.reset_index(drop=True)
4-2-2 Data processing¶
Now that we have dealt with the dependent variable, let's take a look at the characteristics of the independent variable.
Our data set consists of single-choice questions and multiple-choice questions, so correspondingly, multiple-choice and single-choice questions should have different processing methods.
Deduplication
data_1.drop_duplicates(inplace=True)
data_1.shape
(23617, 296)
About handling missing values
When we thought about this problem, we found that it was not easy to solve. For multiple-choice questions, each person filling out the form has questions that they did not fill out, so the missing values cannot be removed casually. The missing values of multiple-choice questions are considered to be 0, and the missing values of single-choice questions are solved by the unique hot encoding below.
coding
def column_name(name):
return [col for col in data_1.columns if name in col]
data = column_name('_')
data_cod = data_1[data]
data_c = ['Q2','Q3','Q4','Q5','Q8','Q9','Q11','Q16','Q22','Q23','Q24','Q25','Q26','Q27','Q30','Q32','Q43']
data_1 = pd.get_dummies(data=data_1, columns=data_c)
data_1.head()
| Duration (in seconds) | Q6_1 | Q6_2 | Q6_3 | Q6_4 | Q6_5 | Q6_6 | Q6_7 | Q6_8 | Q6_9 | ... | Q32_ Tencent Cloud | Q32_ VMware Cloud | Q32_None were satisfactory | Q32_Other | Q32_They all had a similarly enjoyable developer experience | Q43_2-5 times | Q43_6-25 times | Q43_More than 25 times | Q43_Never | Q43_Once | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 121 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 462 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 293 | Coursera | edX | NaN | DataCamp | NaN | Udacity | Udemy | LinkedIn Learning | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 851 | Coursera | NaN | Kaggle Learn Courses | NaN | NaN | NaN | Udemy | NaN | NaN | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 232 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 461 columns
We use the function shown below to select multiple-choice questions
#multiple
listem =[]
for i in data_cod.columns:
listem.append(data_1[i].dropna().unique()[0])
data_1[data] = data_1[data].replace(np.nan,0).replace(listem,1)
data_1.head()
| Duration (in seconds) | Q6_1 | Q6_2 | Q6_3 | Q6_4 | Q6_5 | Q6_6 | Q6_7 | Q6_8 | Q6_9 | ... | Q32_ Tencent Cloud | Q32_ VMware Cloud | Q32_None were satisfactory | Q32_Other | Q32_They all had a similarly enjoyable developer experience | Q43_2-5 times | Q43_6-25 times | Q43_More than 25 times | Q43_Never | Q43_Once | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 121 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 462 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 293 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 851 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 232 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 461 columns
4-2-3 Independent variable selection¶
data_1 = data_1.dropna(subset=['Salary'], how='any')#how: any’, all’, default ‘any’
X = data_1.drop('Salary', axis=1)
y = data_1['Salary']
print(y.shape, X.shape)
# X.to_excel("bigwork.xlsx")
(7756,) (7756, 460)
After processing the data, I found that there are many redundant variables piled up here. So which variables should we use for prediction?
During the learning processIwe found a function SelectKBest function, which perfectly solvethisur problem
You can see that SelectKBest has two parameters, one is score_func and the other is k. We can understand that score_func is a function that scores features and then selects features from high to low. So how many features should be selected? The k at the end is to limit the number of features, and the default is to select 10 features.
#SelectKBest
from sklearn.feature_selection import f_regression
fs = SelectKBest(score_func=f_regression, k='all')#score_func=mutual_info_regression:
fit = fs.fit(X, y)
feature_imp = pd.DataFrame(fs.scores_, columns=['Score'], index=X.columns)#scores_
top20_feature = feature_imp.nlargest(n=20, columns=['Score'])
plt.figure(figsize=(8,5))
g = sns.barplot(y=top20_feature.index, x=top20_feature['Score'])
p = plt.title('Top 20 Features with mutual information gain')
p = plt.xlabel('Feature name')
p = plt.ylabel('Information Gain')
p = g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
After selecting the first 20 data, we want to see what the relationship is between these 20 data. Visualization is the most intuitive and fastest way.
#Dispersed Color
cmap = sns.diverging_palette(h_neg=100, h_pos=200, s=80, l=55, n=9)
plt.figure(figsize=(15, 15))
corr = X[top20_feature.index].corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
#ones_like: Returns an array with the same shape and type as the given array triu: Returns the Mask corresponding to the upper triangular matrix
g = sns.heatmap(corr, annot=True, vmax=0.3, cmap=cmap, mask=mask, square=True, linewidths=0.05)
p = plt.title('Correlation matrix')
In the subsequent prediction process, we found that too many variables are also a headache, because this will produce more influencing factors that reduce the prediction effect.
So when selecting features, we changed the k variables and found the optimal number of variables from the top5 variables, top10 variables, and top20 variables.
from sklearn.feature_selection import f_regression
fs = SelectKBest(score_func=f_regression, k='all')
fit = fs.fit(X, y)
feature_imp = pd.DataFrame(fs.scores_, columns=['Score'], index=X.columns)#scores_ :返回每个特征的得分
top5_feature = feature_imp.nlargest(n=5, columns=['Score'])
plt.figure(figsize=(8,5))
g = sns.barplot(y=top5_feature.index, x=top5_feature['Score'])
p = plt.title('Top 5 Features with mutual information gain')
p = plt.xlabel('Feature name')
p = plt.ylabel('Information Gain')
p = g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
cmap = sns.diverging_palette(h_neg=100, h_pos=200, s=80, l=55, n=9)
plt.figure(figsize=(15, 15))
corr = X[top5_feature.index].corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
g = sns.heatmap(corr, annot=True, vmax=0.3, cmap=cmap, mask=mask, square=True, linewidths=0.05)
p = plt.title('Correlation matrix')
from sklearn.feature_selection import f_regression
fs = SelectKBest(score_func=f_regression, k='all')
fit = fs.fit(X, y)# Pass in feature set x and label y to fit the data.
feature_imp = pd.DataFrame(fs.scores_, columns=['Score'], index=X.columns)#scores_
top10_feature = feature_imp.nlargest(n=10, columns=['Score'])
plt.figure(figsize=(8,5))
g = sns.barplot(y=top10_feature.index, x=top10_feature['Score'])
p = plt.title('Top 10 Features with mutual information gain')
p = plt.xlabel('Feature name')
p = plt.ylabel('Information Gain')
p = g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right')
cmap = sns.diverging_palette(h_neg=100, h_pos=200, s=80, l=55, n=9)
plt.figure(figsize=(15, 15))
corr = X[top10_feature.index].corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
g = sns.heatmap(corr, annot=True, vmax=0.3, cmap=cmap, mask=mask, square=True, linewidths=0.05)
p = plt.title('Correlation matrix')
As can be seen from the above figure, the correlation between some independent variables is very high, so we removed two independent variables, as shown in the following code
# X = X[top20_feature.index]
# list(X.columns)
# X = X[top5_feature.index]
# list(X.columns)
X = X[top10_feature.index]
X.drop(['Q33_1','Q34_3'], axis=1, inplace=True)
list(X.columns)
['Q4_United States of America', 'Q4_India', 'Q28_3', 'Q27_We have well established ML methods (i.e., models in production for more than 2 years)', 'Q16_5-10 years', 'Q11_20+ years', 'Q31_1', 'Q28_5']
4-2 Dataset Partitioning¶
# Divide the dataset
from sklearn.preprocessing import MinMaxScaler#min-max
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
# scaler = MinMaxScaler()
# scaler = scaler.fit(X)
# X = scaler.transform(X)
# scaler = StandardScaler()
# X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
min_max + test_size=0.15
4-3 Model Training¶
#Random Forest
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("RandomForestRegressor结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
RandomForestRegressor结果如下: 训练集分数: 0.9401072528160039 验证集分数: 0.5972944686721949
#Linear Regression
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("LinearRegression结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
LinearRegression结果如下: 训练集分数: 0.6464430857938157 验证集分数: -3.861692249061065e+25
#Ridge Regression
from sklearn.linear_model import Ridge
# x_train,x_test,y_train,y_test = train_test_split(a,b,test_size=0.2)
clf = Ridge()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("Ridge结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
Ridge结果如下: 训练集分数: 0.6465647424733079 验证集分数: 0.6090348828938872
#Lasso Regression
from sklearn.linear_model import Lasso
clf = Lasso()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("Lasso结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
Lasso结果如下: 训练集分数: 0.646564058829653 验证集分数: 0.6091747030930386
#Decision Tree
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("DecisionTreeRegressor结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
DecisionTreeRegressor结果如下: 训练集分数: 1.0 验证集分数: 0.18094718841028012
#Bagging Regression Model
from sklearn.ensemble import BaggingRegressor
clf = BaggingRegressor()
rf = clf.fit (X_train, y_train.ravel())
y_pred = rf.predict(X_test)
print("BaggingRegressor结果如下:")
print("训练集分数:",rf.score(X_train,y_train))
print("验证集分数:",rf.score(X_test,y_test))
BaggingRegressor结果如下: 训练集分数: 0.9170855188835699 验证集分数: 0.5450183800537018
5 Results Analysis¶
From the above training results, we can see that the training effects of various models are poor. I think the possible reasons are as follows:
The data volume is not very large. After removing the missing values of salary, there are only more than 7,000 data, and the data volume is relatively small
In this prediction, we use the mean to replace each salary field. The salary field range is large. Using the mean to replace itself may have a large error and a low accuracy rate
By observing some models, it can be found that the accuracy rate on the training set is very high, while it is very low on the validation set. It is speculated that there may be overfitting.